Methods and systems for using keywords preprocessing, Boyer-Moore analysis, and hybrids thereof, for processing regular expressions in intrusion-prevention systems

ABSTRACT

Methods and systems are provided for using keyword preprocessing, Boyer-Moore analysis, and hybrids thereof, in intrusion-prevention systems. In one embodiment, a state-transition table representative of a data pattern is provided. The table has a plurality of states, each having egress events that define transitions to other states. The data pattern is parsed to identify character strings. A subject is received for evaluation, and preprocessed to find any instances of those character strings. A keyword table is populated with the character strings found during preprocessing. While using the table to evaluate the subject, a first state having a first one of the character strings as an egress event is transitioned into. The keyword table is checked for the first character string, and, responsive to finding the first character string in the keyword table, a transition is taken from the first state to the second state.

RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.60/953,094 entitled “Methods and Systems for Using KeywordPreprocessing, Boyer-Moore Analysis, and Hybrids Thereof, for ProcessingRegular Expressions in Intrusion-Prevention Systems” filed Jul. 31,2007, by Preston, et al.

BACKGROUND

1. Technical Field

The present invention relates to the processing of regular expressions,and, in particular, to improved state-diagram formations andstate-transition processing.

2. Description of Related Art

Packet-data communication, such as communication over the Internet, isextremely popular, and is becoming more so every day. People andcompanies routinely use Internet-connected computers and networks toconduct their affairs. Myriad types of data are transmitted over theInternet, such as personal correspondence, medical information,financial information, business plans, etc. Unfortunately, not all usesof the Internet are benign; on the contrary, a significant percentage ofthe data that is transmitted over the Internet every day is malicious.Examples of this type of data are viruses, spyware, malware, worms, etc.

Not unexpectedly, an entire industry has developed to combat thesevicious attempts to disrupt and harm Internet-based communications,along with the networks and computers used by those who engage inInternet-based communications. This industry, and the effort to fightthese malicious threats generally, is often referred to as “intrusionprevention.” One important aspect of intrusion prevention involvesidentifying known threats (files that are or contain viruses, worms,spyware, malware, etc.) by particular data patterns contained therein.These data patterns are sometimes referred to as “signatures” of thesecurity threats.

As such, data (e.g. IP) packets flowing through, towards, or from aparticular router, switch, network, etc. are often screened—perhaps byan intermediate device, functional component, or other entity sometimesreferred to as a “bump in the wire”—for the presence of these signaturedata patterns. When particular packets, or sequences of packets, areidentified as containing at least one of these signatures, those packets(or, again, sequences of packets) may be “quarantined,” analogous to theway that people or animals having been identified as or suspected ofcarrying a particular disease would be, such that those packets cannotcause harm to any more networks and/or computers. These packets, removedfrom the normal flow of data traffic, can then be further analyzedwithout holding up that traffic generally.

It can thus be appreciated that it would be advantageous for a networkdevice to be able to quickly and accurately identify these signaturedata patterns across one or more packets, and to do so in a way thatuses relatively little in the way of computing resources such asprocessing time and memory.

These signature data patterns are often expressed using what is known asa “regular expression,” which is an instance of a system of notationthat can, fairly elegantly, represent complicated data patterns that, bytheir presence, may indicate a potential security threat. As a smallexample, a regular expression such as “.+[a]{5}[b]” may be used torepresent a data pattern that could be stated as “one or more of (+) anytype of character (.), followed by five consecutive ‘a’ characters([a]{5}), followed by a ‘b’ character ([b]).” Thus, the data beinganalyzed, often referred to as the “subject,” would have to contain thatdata pattern at least once to be considered to match that regularexpression, which often is also referred to as a “regex.”

The screening for these particular data patterns is often implementedusing a state machine, which is typically generated from a particularregex. Basically, the characters (e.g. “@”) and character classes (e.g.“[0-9],” i.e. “any digit”) in the regex define the transitions betweenstates in the machine. A particular data subject, perhaps the payload ofone or more packets, would then be evaluated using the state machine;this evaluation essentially involves starting in an initial state, andusing the characters in the subject to try to transition through thestate machine. If the right sequence of characters is present in thesubject to cause the processing of the state machine to arrive at whatis known as a “match state,” then the subject is considered to match theregex that was used to generate the state machine, and the packet orpackets that contained that subject may be quarantined for furtheranalysis.

State machines are often also referred to as “automata,” and are of oneof two types: deterministic finite automata (DFA) (a deterministic statemachine, a.k.a. a DFA state machine) or nondeterministic finite automata(NFA) (a nondeterministic state machine, a.k.a. an NFA state machine).In general, a DFA will have no ambiguity; that is, from a given state, agiven character in the subject will result in either zero or one validtransitions to a next state. In contrast, an NFA will have ambiguity inthat, from a given state, a given character can, and often does, matchmore than one transition to a next state. Thus, in an NFA, multiplepaths through the state machine may be valid for the same subject.

In general, then, DFAs are typically faster and more straightforward toexecute, but involve a higher number of states, while NFAs typically canbe implemented with many fewer states, which uses far less memory, butare more complex to execute, since multiple valid paths through thestate machine must be assessed. And DFAs and NFAs have other advantagesand disadvantages in comparison with the other, in addition to thosementioned here.

Further with respect to regex processing, the popularity of the PERLscripting language has had a significant effect on the art of processingregular expressions. The PERL syntax incorporates regex processing witha powerful and popular regex feature set. The PERL regex syntax is themost widely used ‘flavor’ today, and regex engines are often referred toas being ‘PERL compatible.’ There is a FOSS (Free Open Source Software)project called PCRE (Perl Compatible Regular Expressions), which is alibrary in the C programming language for compiling and processingregular expressions using the PERL syntax and feature set. This libraryis used in existing Intrusion Prevention System (IPS) products.

PCRE works at least in part by converting a regex to an NFA statemachine. Again, an NFA state machine is referred to as beingnon-deterministic because, in some states, there may be more than onevalid transition out to respective next states. Using an NFA gives a lotof flexibility in terms of language syntax, but processing an NFA canrequire a lot of attempts to match on failed branches of the statemachine. The process of backing up in to a prior state after a failedmatch attempt is called ‘backtracking.’ The conventions used in PCREcall for attempting to match as much text as possible, and backtrackingif the match fails. For this reason, backtracking is common andperformance suffers. Note that the type of search done using the PCREengine is called a ‘depth-first’ search, because it tries the deepestpaths through the NFA state machine first, and then backtracks to tryother branches.

As also noted above, there is a different type of state machine, calleda DFA state machine (or just DFA), which does not require backtracking.It is possible to convert an NFA to a DFA with some restrictions onsyntax features (e.g., no backreferences). A DFA is more like atraditional state machine that is driven from state to state based onevents (i.e. characters in the subject being searched). As noted above,DFAs are generally faster than NFAs, but usually have more states, andthus require more memory.

Recent versions of PCRE include an Application Program Interface (API)purporting to implement a DFA search, though PCRE does not generate aDFA in this case. Instead, these versions walk an NFA in a breadth-firstmanner. This process is, in essence, like generating a DFA on the fly,every time the search is performed. The results are the same, but theperformance is very slow.

SUMMARY

Methods and systems are provided for using keyword preprocessing,Boyer-Moore analysis, and hybrids thereof, for processing regularexpressions in intrusion-prevention systems. An exemplary method may becarried out in an intrusion-prevention system for examining networktraffic and identifying therein the presence of signature data patterns.In accordance with the method, a state-transition table is provided,said table representative of a predetermined data pattern, andcomprising a plurality of states, each state having a set of egressevents, each egress event defining a transition from a current state toa next state.

The state-transition table may be representative of a state diagram thatitself is representative of the predetermined data pattern. Thepredetermined data pattern may be representative of a regularexpression. Furthermore, each egress event may be either a characterclass or a character string. The predetermined data pattern is parsed toidentify a set of character strings therein. The identified set ofcharacter strings in the predetermined data pattern may consist of thosecharacter strings that (a) include at least two distinct characters and(b) have a string length that is greater than a threshold number.

Further in accordance with the exemplary method, a subject is received,where the subject is to be evaluated for the presence of thepredetermined data pattern. The subject is preprocessed to find thereinany instances of the character strings identified in the predetermineddata pattern. Preprocessing the subject may involve using a keyword-treesearch. A keyword table is then populated with a subset of theidentified character strings, the subset consisting of those characterstrings found in the subject during preprocessing. The subject maycomprise a payload of one or more packets, and the presence of thepredetermined data pattern may be indicative of a potential securitythreat.

While using the state-transition table to evaluate the subject for thepresence of the predetermined data pattern, a first state istransitioned into, where the first state has a first one of theidentified character strings a first egress event thereof. The firstegress event defines a transition from the first state to a secondstate. Responsive to transitioning into the first state, the keywordtable is checked for the first character string. Responsive to findingthe first character string in the keyword table, the transition is takenfrom the first state to the second state. Transitioning from one stateto another may involve recursively calling a state-search function.

Preprocessing the subject may involve identifying positions in thesubject where the instances of the identified character strings arelocated, and the keyword table may be populated with the identifiedpositions. Furthermore, a first-state range may be calculated, where thefirst-state range is a range of positions in the subject in which tosearch for the presence of at least one of the first state's egressevents. As such, checking the keyword table for the first characterstring may involve checking the keyword table for an instance of thefirst character string at a position within the first-state range. Also,finding the first character string in the keyword table may involvefinding in the keyword table an instance of the first character stringat a position within the first-state range.

A cursor may correspond to a location in the subject that is currentlybeing evaluated. Furthermore, the transition into the first state mayhave been from a state referred to here as a previous state, accordingto an egress event referred to here as a previous-state egress event,and the previous state may have an associated previous-state range inthe subject. Calculating the first-state range may involve setting astart of the first-state range equal to the cursor; and then, startingat the cursor, and extending no further than an end of theprevious-state range, determining that the subject includes a number ofconsecutive instances of the previous-state egress event that end at afirst position in the subject; and setting an end of the first-staterange based on the first position. Furthermore, it may be determinedthat the first state does not have a character-class loop transition.And the previous-state range may be calculated.

Alternatively, calculating the first-state range may comprisedetermining that the first state has a character-class loop transition;setting a start of the first-state range equal to the cursor; and then,starting at the cursor, determining that the subject includes a numberof consecutive characters that satisfy the character-class looptransition and that end at a first position in the subject; and settingan end of the first-state range based on the first position.

In another aspect, an exemplary embodiment may take the form of a methodthat may also be carried out in an intrusion-prevention system forexamining network traffic and identifying therein the presence ofsignature data patterns. In accordance with the method, astate-transition table is provided, where the table is representative ofa predetermined data pattern, and includes states that each have a setof egress events that define transitions to next states. Thestate-transition table may be representative of a state diagram, wherethe state diagram is representative of the predetermined data pattern.The predetermined data pattern may be representative of a regularexpression. Furthermore, each egress event may be either a characterclass or a character string.

Further in accordance with this embodiment, a subject is received forevaluation for the presence of the predetermined data pattern. Thesubject may comprise a payload of one or more packets, and the presenceof the predetermined data pattern may be indicative of a potentialsecurity threat. While using the state-transition table to evaluate thesubject, a first state is transitioned into, where the first state has afirst character string as a first egress event thereof, defining atransition from the first state to a second state.

Responsive to transitioning into the first state, a Boyer-Moore searchis performed for the first character string in the subject. This searchmay be performed responsive to making either or both of the followingdeterminations: (a) that the first character string does not include atleast two distinct characters and (b) that the first character stringhas a string length that is less than a threshold number.

In some embodiments, a first-state range may be calculated, where thefirst-state range is a range of positions in the subject in which tosearch for the presence of at least one of the first state's egressevents. As such, performing the Boyer-Moore search for the firstcharacter string in the subject may comprise performing the Boyer-Mooresearch for the first character string in the first-state range.

A cursor may correspond to a location in the subject that is currentlybeing evaluated. Furthermore, transitioning into the first state mayinvolve transitioning from a previous state into the first stateaccording to a previous-state egress event, where the previous state hasan associated previous-state range in the subject. As such, calculatingthe first-state range may comprise setting a start of the first-staterange equal to the cursor; starting at the cursor, and extending nofurther than an end of the previous-state range, determining that thesubject includes a number of consecutive instances of the previous-stateegress event, the consecutive instances ending at a first position inthe subject; and setting an end of the first-state range based on thefirst position. It may also be determined that the first state does nothave a character-class loop transition. And the previous-state range maybe calculated.

In other embodiments, calculating the first-state range may comprisedetermining that the first state has a character-class loop transition;setting a start of the first-state range equal to the cursor; startingat the cursor, determining that the subject includes a number ofconsecutive characters that satisfy the character-class loop transition,the consecutive instances ending at a first position in the subject; andsetting an end of the first-state range based on the first position.

Further in accordance with the methods, upon the Boyer-Moore searchdetermining that an instance of the first character string is present inthe subject, the transition from the first state to the second state isresponsively taken. Transitioning from one state to another may involverecursively calling a state-search function.

In yet another aspect, another exemplary embodiment may take the form ofa method carried out in an intrusion-prevention system for examiningnetwork traffic and identifying therein the presence of signature datapatterns. In accordance with this method, a state-transition tablerepresentative of a predetermined data pattern is provided, where thestate-transition table comprises a plurality of states, each statehaving a set of egress events, each egress event defining a transitionfrom a current state to a next state. The state-transition table may berepresentative of a state diagram, which itself may be representative ofthe predetermined data pattern. The predetermined data pattern may berepresentative of a regular expression. And each egress event may beeither a character class or a character string.

Further in accordance with this method, the predetermined data patternis parsed to identify a set of a first type of character stringstherein. The first type of character string may be defined by both (a)including at least two distinct characters and (b) having a stringlength greater than a threshold number.

Further in accordance with this method, a subject is received forevaluation for the presence of the predetermined data pattern. Thesubject may comprise a payload of one or more packets, and the presenceof the predetermined data pattern may be indicative of a potentialsecurity threat. The subject is preprocessed, perhaps using akeyword-tree search, to find therein any instances of the identifiedcharacter strings.

Further in accordance with this method, a keyword table is populatedwith a subset of the identified character strings, the subset consistingof those character strings found in the subject during preprocessing.Preprocessing the subject may involve identifying positions in thesubject where the instances of the identified character strings arelocated, and populating the keyword table with the identified positions.

Further in accordance with this method, while using the state-transitiontable to evaluate the subject for the presence of the predetermined datapattern, a first state is transitioned into. The first state has a givencharacter string as a first egress event thereof, where the first egressevent defines a transition from the first state to a second state.

Further in accordance with this method, responsive to transitioning intothe first state, the subject is searched for an instance of the givencharacter string, and, responsive to determining that there is aninstance of the given character string in the subject, the transitionfrom the first state to the second state is taken.

When the given character string is of the first type, searching thesubject for an instance of the given character string comprises checkingthe keyword table for the given character string, and determining thatthere is an instance of the given character string in the subjectcomprises finding the given character string in the keyword table.

When the given character string is of a second type different from thefirst type, searching the subject for an instance of the given characterstring comprises performing a Boyer-Moore search for the given characterstring in the subject, and determining that there is an instance of thegiven character string in the subject comprises the Boyer-Moore searchdetermining that an instance of the given character string is present inthe subject. The second type of character string may be defined byeither or both of (a) not including at least two distinct characters and(b) having a string length less than or equal to the threshold number.

In some embodiments, a first-state range may be calculated, where thefirst-state range is a range of positions in the subject in which tosearch for the presence of at least one of the first state's egressevents. As such, checking the keyword table for the given characterstring may involve checking the keyword table for an instance of thegiven character string at a position within the first-state range.Moreover, finding the given character string in the keyword table mayinvolve finding in the keyword table an instance of the given characterstring at a position within the first-state range. And performing theBoyer-Moore search for the given character string in the subject mayinvolve performing the Boyer-Moore search for the given character stringin the first-state range, while the Boyer-Moore search determining thatan instance of the given character string is present in the subject mayinvolve the Boyer-Moore search determining that an instance of the givencharacter string is present in the first-state range.

A cursor may correspond to a location in the subject that is currentlybeing evaluated. And transitioning into the first state may comprisetransitioning from a previous state into the first state according to aprevious-state egress event, where the previous state has an associatedprevious-state range in the subject. As such, calculating thefirst-state range may comprise setting a start of the first-state rangeequal to the cursor; starting at the cursor, and extending no furtherthan an end of the previous-state range, determining that the subjectincludes a number of consecutive instances of the previous-state egressevent, the consecutive instances ending at a first position in thesubject; and setting an end of the first-state range based on the firstposition. It may be determined that the first state does not have acharacter-class loop transition. And the previous-state range may becalculated.

In other embodiments, calculating the first-state range may comprisedetermining that the first state has a character-class loop transition;setting a start of the first-state range equal to the cursor; startingat the cursor, determining that the subject includes a number ofconsecutive characters that satisfy the character-class loop transition,the consecutive instances ending at a first position in the subject; andsetting an end of the first-state range based on the first position. Inthis embodiment as in all embodiments, transitioning from one state toanother state may involve recursively calling a state-search function.

In another aspect, an exemplary embodiment may take the form of anintrusion-prevention network device for examining network traffic andidentifying therein the presence of signature data patterns. The networkdevice comprises a network interface, a processor, and data storage. Thedata storage comprises a state-transition table representative of apredetermined data pattern, the state-transition table comprising aplurality of states, each state having a set of egress events, eachegress event defining a transition from a current state to a next state.

The data storage further comprises instructions executable by theprocessor to parse the predetermined data pattern to identify a set ofcharacter strings therein; receive a subject to be evaluated for thepresence of the predetermined data pattern, and preprocess the subjectto find therein any instances of the identified character strings;populate a keyword table with a subset of the identified characterstrings, the subset consisting of those character strings found in thesubject during preprocessing; while using the state-transition table toevaluate the subject for the presence of the predetermined data pattern,transition into a first state having a first one of the identifiedcharacter strings as a first egress event thereof, the first egressevent defining a transition from the first state to a second state; andresponsive to transitioning into the first state, check the keywordtable for the first character string, and, responsive to finding thefirst character string in the keyword table, transition from the firststate to the second state.

Note as well that some or all of the variations described above withrespect to the method embodiments may also apply to the network-deviceembodiments, in any suitable combinations and permutations.

These as well as other aspects and advantages will become apparent tothose of ordinary skill in the art by reading the following detaileddescription, with reference where appropriate to the accompanyingdrawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of a parsed regex.

FIGS. 2A-2F are state flow diagrams for terms in a regex.

FIGS. 3A-3C are state flow diagrams for states having lambda counttransitions.

FIG. 4 is a state diagram showing an NFA to DFA conversion.

FIGS. 5A-5C are diagrams depicting the use of checked ranges.

FIGS. 6A-6B are state diagrams exemplifying aspects of the statetransitions.

FIGS. 7A-7C are diagrams depicting the incremental operation of thestate analysis.

FIGS. 8A-8C show the various components of a rules base for analyzingdata subjects.

FIG. 9 is a flow chart depicting an example of a method.

FIG. 10 is a flow chart depicting an example of a method.

FIG. 11 is a flow chart depicting an example of a method.

DETAILED DESCRIPTION OF THE DRAWINGS 1. Introduction

Described herein are aspects of an improved regular expression enginepreferably for use in data-network security products includingintrusion-prevention and intrusion-detection systems, as regularexpressions may be used to classify network traffic as malicious orbenign. The improved regex engine provides a fast and flexible method ofanalyzing data.

Various embodiments may use one or more of the following features: (i)the analysis of states and transitions between them using dynamicallydetermined ranges (state ranges) within the subject being processed;(ii) state transitions that are triggered by a count associated withtransitioning into a given state (referred to herein as lambdatransitions); (iii) state transitions that are triggered by strings, andthe identification of strings in (a) a pre-processing step, (b) anefficient string-identification algorithm, or (c) a hybrid of (a) and(b); and (iv) a method of suspending and restarting regex processing asnew data on a flow arrives, without excessively caching already-receiveddata.

The method of analyzing data described herein is state-machine driven asopposed to subject driven. When in a particular state, the systemexamines the possible egress transitions (i.e. transitions from theparticular state to some other state, as opposed to a loop transition,which begins and ends at the same state) out of the state, and looks forevents in the subject corresponding to the egress transitions. Note thatthe “subject” refers to the data stream in which the regular-expressionengine is looking for a match to a given regular expression. This is adifferent approach than feeding events (characters) as input to a statemachine, driving transitions in that state machine.

The present methods and systems are not a pure DFA implementation,although they incorporate some DFA concepts. In a DFA, each input event(character) results in a single, unambiguous transition, and there isonly one valid state at a time. The present methods and systems redefinethe input alphabet (i.e. the types of terms in the subject that triggertransitions from current states to next states) for each regex to bestring matches or character-class matches. Note that these events canoverlap, if a character in the subject is part of a string match, butalso matches a character-class transition. In cases like this, there aremultiple paths that must be checked. The present methods and systemsattempt to optimize the checking of these paths.

To produce a DFA state machine—representing a particular regex—thatutilizes the improvements described herein, an NFA may be converted to aDFA. The NFA is preferably obtained by first (a) parsing the particularregex into a tree of terms and then (b) using a known algorithm, knownas Thompson's algorithm, for converting an NFA into a DFA. The terms(i.e. events that drive state-machine transitions) defined herein arecharacter-class (CC) terms and string (i.e. character string) terms, aswell as grouped sub-expressions combining strings and CCs. These termsmay be stored in a CC Table and Keyword Tree, respectively.

In the example shown in FIG. 1, the regex is^([+−]?[0-9]+(\.[0-9]*)?)[CF]$. This regex is shown in FIG. 1 as havingbeen parsed into a tree 100, where every leaf (i.e. a node without anynodes depending therefrom) on the tree represents a term in the regex.The terms of the regex are connected by concatenation (“C”) nodes, eachrepresenting a logical AND. The tree 100 in FIG. 1 is a way of depictingthe various levels of parenthetically-nested terms in the regex ofFIG. 1. Furthermore, it can be seen that this regex begins with a “^”,which is known as a start anchor. This means that any matching subjectwould have to match the portion of the regex that follows the “^” fromthe beginning of the subject.

The next term in the regex is “[+−]”, modified by the “?” specifier.This indicates that the regex is looking for either a “+” or a “−”character, or neither, as the “?” specifier indicates that one or noneof the preceding term is considered a match. Next is the characterclass“[0-9]”, indicating “any digit 0-9”, modified by the “+”,indicating “one or more of the preceding. Putting these two together,one or more digits would be a match.

Next comes a parenthetical that is modified by the “?”, indicating thatone or none of what is in the parenthetical would match. Inside theparenthetical, the first term is “\.”, where the backslash indicates anescape character, which means that a “.” is actually sought in thesubject, rather than the “.” meaning “any character”, which it oftenindicates in regex processing. Next comes the digit character classagain, modified by the “*”, which means “zero or more of the preceding.”

Next comes the character class “[CF]”, indicating that either a “C” oran “F” (but not both) would be acceptable. Finally comes the “$”, whichis known as an end anchor, indicating that, to match this regex, aparticular subject would have to end following the “C” or “F”. Thus,this regex is looking for subjects such as “3.2F”, “+346.78C”,“−987.326F”, “45F”, etc. Essentially, the regex is looking for what maybe expressions of Celsius or Fahrenheit temperatures, with or without adecimal point and one or more digits thereafter.

FIGS. 2A-2F depict various terms (or combinations thereof) that may bepresent in regular expressions, where each term is paired with agraphical representation of a state machine, or perhaps a subpart of alarger state machine, that implements the associated term or combinationof terms. In FIGS. 2A-2F, “A” and “B” each represent a term in a regularexpression. Note that those terms may be similar to the above examples,or may take other forms. Three basic non-repeat forms are A, AB, andA|B, as shown in FIGS. 2A-2C, respectively. Thus, FIG. 2A depicts just“A” and the associated pair of states labeled “0” and “1”. As can beappreciated from FIG. 2A, the presence of the term A in the data subjectbeing analyzed results in the transition being taken from state 0 tostate 1.

FIG. 2B depicts a portion of a state machine with an additional state“2”, and generally shows how a state machine would look for the terms Aand B in series to result in transitioning from state 0 to state 1, andfrom state 1 to state 2. FIG. 2C shows an alternative structure, where atransition from state 0 to state 1 can be taken if either A or B ispresent in the subject, and, more specifically, present at the point inthe subject that is currently being evaluated when the processingthrough the state machine is in state 0.

FIGS. 2D-2F depict what are known as repeat specifiers, which generallyrefers to the above-explained ?,+, and * that come after the term A inFIGS. 2D-2F, respectively. The ? in FIG. 2D indicates that the precedingterm A need be present either zero or one times at thecurrently-evaluated position in the subject, to make the transition fromstate 0 to state 1. This is shown by the two alternative transitionsavailable in FIG. 2D, where the upper transition is driven by thepresence in the subject of the term “A,” while the lower transition iswhat is known as an “epsilon” transition, which corresponds to thetransition being valid regardless of what is present at thecurrently-evaluated portion of the subject. Essentially, FIG. 2D isconveying that a transition can be made on an A or on nothing (an emptystring).

This illustrates some general terminology related to the present methodsand systems. Generally, the data being analyzed by the state machine isreferred to as the “subject,” or “data subject,” and may, in the contextof assessing network traffic in an intrusion-prevention system, includea payload of one or more packets, such as Internet Protocol (IP)packets. Furthermore, a value known as the “cursor” is maintained, whichcorresponds to a particular position in that subject—typically indexedstarting with zero—that is currently being evaluated by the statemachine. Note that, in typical state-machine implementations, atransition from one state to another corresponds to “consuming” a singlecharacter in the subject, and, accordingly, advancing the cursor to thenext position to be analyzed while in the next state. Epsilontransitions, however, do not advance the cursor.

Continuing the previous examples, FIG. 2E depicts the state-machineimplementation of the + quantifier, which generally indicates in FIG. 2Ethat the state machine will be looking for one or more consecutiveoccurrences of the term A, which, again, could be a string, or aparticular character class (such as digits 0-9 or letters a-z). This canbe seen in the state-machine implementation where, with epsilontransitions, a single A will result in arriving at state 3, as will anynumber of consecutive A terms. FIG. 2E is clearly an NFA, then, as, forexample, valid epsilon transitions are present from state 2 to bothstate 3 and to state 1.

Finally, FIG. 2F depicts the * quantifier, which generally indicatesthat the state machine will be looking for zero or more consecutiveinstances of the A term. This is shown in FIG. 2F as merely modifyingFIG. 2E by adding an epsilon transition from state 0 to state 3. Notethat, in each of the examples shown in FIGS. 2A-2F, the state machine isconsidered to have found in the subject a match for the associated term(or terms) if the processing of the subject results in the statemachines (or state-machine sections) proceeding from the left-most stateto the right-most state. In that sense, the right-most state in each ofFIGS. 2A-2F may be referred to as a “match state” (also known as an“accepting state”) for the associated term or terms (and associatedquantifier if applicable).

2. Lambda Transitions

In addition to the operators and specifiers described above, someregexes include specifiers known as “count specifiers,” which indicatehow many of the preceding term are being checked for at thecurrently-evaluated portion of the subject. For example, if a regexincluded A{5}, this would indicate that five consecutive occurrences ofthe “A” term were being looked for in the subject. And again, A could bea character string, such as “boy”, or a character class, such as [0-9],which corresponds to “any digit 0-9”. More generally, A{x} wouldindicate x consecutive instances of A. In a traditional state-machineimplementation, this would take the form of a series of six states, withan “A” transition between each pair of states in the series.

Another type of count specifier that appears in regexes takes the formA{x,y}, which generally correlates to looking for at least x, but nomore than y, consecutive instances of A in the currently-evaluated partof the subject. In a traditional state-machine implementation, thiswould take the form of a series of approximately x+1 states in series,with “A” transitions in between, that require x consecutive A terms,followed by a series of states having “A” transitions to the next state,but also having epsilon transitions to a match state, to look for (y−x)more A terms. Thus, processing would be complex, with numerous validalternative paths.

The present methods and systems simplify state-machine implementationand processing using a type of transition referred to herein as a“lambda transition.” Every state preferably has a count parameter (i.e.state count) that is initialized to zero, and then is incremented eachtime the state is entered (i.e. transitioned into) and decremented whenthe state is ‘backed out’ of after an attempted match. This is tofacilitate the two types of lambda transitions described herein,referred to as “Lambda1” and “Lambda2” transitions. That is, instead oftaking a repeat count such as A{5} and converting it to a series of sixstates as described above, two new functions Lambda1 and Lambda2 areprovided as follows:

-   -   Lambda1(x): transition if the count of the current state is        equal to x.    -   Lambda2(x): transition if the count of the current state is less        than or equal to x.

FIG. 3A depicts the simplified state-machine implementation of A{x} whenthe Lambda1 function is incorporated along with state counts. Thus,state 2 maintains its state count, and increments that state count eachtime state 2 is transitioned into. Once x consecutive occurrences of Ahave been identified in the subject, state 2's state count will be equalto x, and state 2 will thus execute the Lambda1 transition to state 3.It can be seen that the cooperation of state counts and the Lambda1transition obviate the need for an extended series of states to look fora certain number of consecutive occurrences of a particular term.

FIG. 3B depicts the simplified state-machine implementation of A{0,y}when the Lambda2 function is incorporated along with state counts.According to the A{0,y} subexpression, at least zero and no more than yconsecutive occurrences of A are being sought in the currently-evaluatedportion of the subject. FIG. 3B is similar to FIG. 3A; however, FIG. 3Bdoes include an epsilon transition from state 0 to state 3, since zeroconsecutive occurrences of A is considered a match for thissubexpression.

Another difference is that the transition from state 2 to state 3 is aLambda2 transition instead of a Lambda1 transition, and will thus betaken if state 2 is entered, its state count is incremented, and thatincremented state count is less than or equal to y, which satisfies thecondition of “at least zero and no more than y consecutive occurrencesof A. Thus, the incorporation of the Lambda2 function and state countsalso reduces the number of necessary states, since a traditionalstate-machine implementation would have a series of approximately y+1states with “A” transitions in between, and each with an epsilontransition to a match state.

Other patterns can be constructed. For example, FIG. 3C depicts thestate machine implementation of A{x,y}, which is equivalent toA{x}A{0,(y−x)}. Once A{x,y} is rewritten as A{x}A{0,(y−x)}, it can beappreciated that it can be implemented as two smaller state machinesjoined by an intermediate state 3. The first of these smaller statemachines matches that shown in FIG. 3A, and looks for x consecutiveinstances of A. The second is similar to FIG. 3B, with the exception ofthe Lambda2 transition having a threshold value of (y−x) instead of y.

3. NFA to DFA Conversion

As referenced above, an algorithm known as Thompson's algorithm can beused to convert an NFA to a DFA. To carry out this algorithm, an initialstep is to define a start state and an end state, and pass those to afunction along with a root node of the tree. The function definesintermediate nodes as required, and recursively inserts terms to producea state graph 400 like the one shown in FIG. 4.

The NFA 400 is then converted to a DFA 402. In general, a given state inthe DFA corresponds to a set of states from the NFA. The NFA state setdefines the DFA state and which transitions are valid. To determinethese state sets, a recursive process is carried out, starting with theinitial state in the DFA (D0) being set to include the initial NFA state(S0) as a seed. After that, the following steps are performed on S0 and,in turn, on the other states in the NFA 400, to arrive at the mappingshown in FIG. 4:

-   -   (i) Expand the seed state to the epsilon closure, thus defining        a state set in the NFA. In other words, for every NFA state in        the state set, follow the epsilon transitions and include those        states in the state set as well. Continue this until all        epsilon-reachable states are included;    -   (ii) Follow non-epsilon transitions to define next states for        the currently-evaluated DFA state. For unique non-epsilon        transitions from the NFA states, there will be a next state. A        unique transition is defined by the event that triggers the        transition (i.e., a string match for ‘boy’). Note that, in an        NFA state set, there may be multiple transitions to various        states based on the same event. All of these are grouped into        one transition to a DFA state that is seeded with the target NFA        states; and    -   (iii) Recursively repeat the process from step (i), starting        with the new DFA state containing the target NFA state set.

After expanding a DFA's states to the epsilon closure, there is a checkto make sure that there is not already a DFA state that includes exactlythat NFA state set. This prevents generation of duplicative DFA states.

Thus, in the example of FIG. 4, one may start with D0:{S0}. Becausethere are no epsilon transitions from S0, D0 is complete. Next thetransition from S0 to S4 is analyzed. Including the epsilon closure ofS4 in a new DFA state, D1 is obtained:

-   -   D0:{S0}    -   D1:{S4,S6,S7}

The transitions from the D1 state set are from S4 to S6 and from S7 toS8. Each of these results in new DFA states. Continuing this processgives the complete set of DFA states:

-   -   D0:{S0}    -   D1:{S4,S6,S7}    -   D2:{S6,S7}    -   D3:{S3,S5,S7,S8}    -   D4:{S2}    -   D5:{S1}    -   D6:{S3,S9,S10}    -   D7:{S3,S10,S11}

Connecting these states with the appropriate transitions gives DFA 402of FIG. 4. Note that DFA 402 utilizes the concept of character classes(CC). In accordance with the present methods and systems, CC matches aredetermined as needed during processing of the regex, and, to facilitatemore rapid processing, each CC term in the regex is converted to abit-mask array in a stored rule base. In particular, if there are 128possible characters, then each character of a particular character classis mapped to a bit within a 128-bit mask word. Each character class maytherefore be represented by a 128-bit mask-word where the 1 bitsindicate that the corresponding character is a member of the class. Todetermine whether a given character of the subject being analyzed is amember of the class, it is simply converted to a corresponding bitposition, and that position is interrogated to check whether it is a 1.

The DFA of FIG. 4 also includes what are referred to herein ascharacter-class-loop (CC-loop) states. A CC-Loop state is a state thathas a character-class loop transition, which is a transition to itself,where that transition is associated with a CC event. For example, statesD3 and D7 in the DFA 402 of FIG. 4 are CC-loop states. Using D3 as anexample, it can be seen that there is a character-class loop transitionout of and back into D3, where the subject event driving that transitionis any digit 0-9.

Note that DFA 402 also includes anchors. A regex may be considered ananchored regex if it includes a beginning anchor (^), which specifiesthat matching the ensuing terms in the regex must start at the beginningof the subject. The start anchor is thus depicted as the transition fromstate D0 to state D1. A regex may also be considered an anchored regexif it includes an end anchor ($), which specifies that the subject mustend (e.g. with an end-of-data indication, end-of-file indication, etc.)once the regex term preceding the end anchor has been identified in thesubject. In DFA 402, an end anchor is the transition between D4 and D5.

In the example DFA 402 of FIG. 4, state D5 is marked with an asterisk,indicating that D5 is what is known as an accepting state. In general,an accepting state in a DFA that was generated from an NFA is any statethat includes the NFA's end state (here, state S1 of NFA 400) as one ofits component states. If an accepting state is reached during regexprocessing, the subject that is currently being processed is consideredto have matched the regex that was used to generate the NFA in the firstplace, in accordance with the principles described above. In the contextof intrusion prevention, this may result in the subject (or thepacket(s) that included the subject) being quarantined for furtherprocessing. After DFA is generated from the NFA, all of the DFA statesare checked to see if they include the NFA's end state. Those that doare marked as accepting states. In FIG. 4, state D5 is the only statethat includes NFA state S1, which is the end state in the NFA graph. D5is therefore marked as an accepting state.

In accordance with the present methods and systems, after generation ofthe NFA from the regex, and further after generation of the DFA from theNFA, the DFA is preferably checked for certain cases involving countspecifiers (i.e., lambda transitions) that may not be handled properly.If these patterns are found, an alternative prior art method such asPCRE may be used to handle the particular regex. Some specific examplesinclude (i) repeat subexpressions with optional last terms (if the lastterm of the subexpression is optional, and the subexpression as a wholeis modified by a repeat specifier, the algorithm for generating the DFAgraph results in the Lambda transition splitting between two DFAstates); and (ii) mixed loop-count states (string and CC-loop events,when there is a loop transition associated with a character-stringevent, as well as a character-class loop transition in the same countingstate). In addition, certain features are not supported, includingbackreferencing (referring to text that was matched by a previousportion of the regex by a label at a later point in the regex) andatomic captures (referring to an ability to specify that certainsubexpressions in a regex must be matched indivisibly, using a searchpolicy known in regex processing as greedy matching).

4. State Ranges

In one aspect, the preferred system and methods avoid re-evaluatingparts of the subject (the network traffic or byte stream that is beingchecked against one or more regexes) as much as possible. To do this,the system utilizes a concept referred to herein as a “checked range”,also referred to herein as a “state range”. The checked range is used todetermine how far to advance in the subject when checking fortransitions from a current state to a next state.

Note that, in general, a transition that connects a current state to anext state in the state diagram is also referred to herein as an “egressevent” for the current state (and an “ingress event” for the nextstate). The checked range is denoted as a pair of positions delineatingthe start and end positions of a substring of the subject. The rangestart and end positions are inclusive; for example, if a checked rangewere [5,8], this would include characters at positions 5, 6, 7, and 8 inthe subject.

As shown in FIG. 5A, traditional state-machine implementation andprocessing might involve numerous transitions 502 from state D0 to stateD1, and failure back to state D0, before a valid transition 504 to stateD2 is identified at a later position in the subject. In general, inFIGS. 5A, 5B, and 5C, the arrows represent transitions between states(while failures back to states are not shown), the horizontal lines(labeled by state name along the left side) represent different statesin the DFA, and the horizontal position on those horizontal linesrepresents relative position in the subject being analyzed. Thus, theleft end of the horizontal lines would be more towards the beginning ofthe subject, while the right end of the horizontal lines would be moretowards the end of the subject.

FIGS. 5B and 5C also include rectangles placed along several of thehorizontal lines. These rectangles represent checked ranges that havebeen calculated for their respective states as further described below.Thus, FIGS. 5B and 5C demonstrate the benefit of the checked-rangeconcept of the present methods and systems, though not the calculationof those checked ranges. In general, upon transitioning into aparticular state, a checked range for that state is calculated, wherethat checked range corresponds to a range of positions in the subject inwhich the present methods and systems will search for a valid transitionto a next state (i.e. for an egress event).

With reference to FIG. 5B, state D0 is shown as having a checked range512, while state D1 is shown as having a checked range 508. Thus,processing is initially in state D0, where checked range 512 iscalculated for state D0. D0 may have one or more transitions to otherstates in the state diagram. Note that the states referenced in FIGS.5A-5C are not necessarily those from FIG. 4; rather, they generallyrepresent states in an arbitrary state diagram that is generated andevaluated in accordance with the present methods and systems.

Returning to FIG. 5B, then, the subject is checked within checked range512 for a valid egress event to D1. Such an egress event is found andtaken at 506. Thus, processing then moves to state D1 as the currentstate, at which point D1's checked range 508 is determined. The subjectis then checked within D1's checked range 508 for a valid egress eventout of D1, and such an event is found and taken to state D2 at 510. Itcan thus be appreciated in FIG. 5B that, by using checked ranges, D2 isreached in just two transitions rather than needing the many attemptsand failures involving many transitions as shown in the traditionalstate-machine implementation of FIG. 5A.

In general, according to the present methods and systems, the statemachine intelligently looks ahead in the subject for valid egressevents, rather than naively trying a single character at a time, takingthe transitions, failing back to the previous state, and repeating thisprocess until, for example, the transition 504 from D1 to D2 is found.As described more fully below, while D1 is the current state, thepresent system is evaluating all of the positions in the subject atwhich the transition from D0 to D1 could happen, to try to find one ofthose positions that would result in a valid transition from D1 to D2.The present system, then, shortcuts and precludes the numerous attemptsand failures by seeing that they would happen, and advancing through thesubject to find a point in the subject at which such attempts andfailures would end, and a valid transition (egress event), for examplefrom D1 to D2, can be located. And this is done repeatedly in arecursive fashion, accomplishing faster and more efficient subjectprocessing than is possible in traditional state-machineimplementations.

In preferred embodiments, as described above, the subject is evaluatedwithin the checked range for a given state for the presence of furthertransitions (egress events to next states) whenever a state is entered(i.e. transitioned into). In some cases, the checked range for a givenstate can be thought of as the range of positions in the subject forwhich a transition into the given state is possible. In general, achecked range is referred to as such because it is the range ofpositions in the subject that will be checked until the presence of avalid egress event is found in the subject, or until all of the currentstate's transitions to next states have been evaluated in the checkedrange.

A checked range (i.e. state range) can be as long as the entire subject,or as short as a single character, like [k,k], where ‘k’ represents agiven index (i.e. position) in the subject. In general, duringprocessing, a value known as the “cursor” is maintained, whichcorresponds to a position in the subject that is currently beingevaluated, and thus is dynamically updated during processing. Typically,a given state's checked range will begin at the position where thecursor is when the given state is transitioned into, and will becalculated to be something between [cursor,cursor] and[cursor,end-of-subject], inclusive.

As referenced herein, it sometimes occurs in processing a subject usinga given state machine that no valid egress events can be found for agiven state within that state's checked range; in that situation,processing returns to the state from which the given state wastransitioned into, often referred to herein as the “previous state”. Asshown in FIG. 5C, in accordance with the present methods and systems,when a state fails back to a previous state, such as from D1 to D0 at520 and 522, the checked ranges 524, 526 of D1 are used to advance thecursor when in D0 to the end of the most-recently-calculated checkedrange of D1, as shown by arrows 528, 530. This is because any of theevents in the subject in that range that would have caused an egresstransition from D0 into D1 have already been checked for D1 egresstransitions. Thus, reprocessing of the same portion of the subject isreduced.

This type of processing is illustrated in the following pseudocode,which pertains to a current state having an egress event to a next state(nextState), where that egress event is associated with matching aparticular character class in the subject. Note that, when thetransition to nextState is taken, this is accomplished by calling arecursive “search-state” function. Thus, the illustrated pseudocodewould also, in this embodiment, be inside that same “search-state”function, which operates to take transitions in the state machine bycalling itself. It can be appreciated from this pseudocode that, ifprocessing backs out of and returns from this call to search-statewithout arriving at an accepting state (i.e. match state) (at whichpoint processing would stop and the subject would be considered to matchthe regex that was used to generate the state machine), then the cursoris advanced to the position just past the next state's checked range(i.e. checkedEnd+1).

while(the cursor is not yet to the end of the subject)

{

if(subject matches the character class at the cursor)

{

-   -   // Take the transition to the next state    -   search-state(transition→nextState, cursor);    -   // Skip ahead in the subject so as to avoid needlessly    -   // rechecking nextState before the end of its checked range    -   cursor=transition→nextState→checkedEnd+1;

}

else increment the cursor;

}

For states that have a CC-loop event, an intermediate range, referred toherein as a a “term range” is first calculated, prior to arriving at ananswer for those states' checked ranges. In general, this term range forCC-loop states marks the range in the subject, starting at the cursor,where the CC-loop event matches. This type of state then calculates itsown checked range based on the term range, essentially determining inwhat subset (which could be all) of the term range a valid egress eventcould occur, and this calculation may differ depending on the type ofegress events the given state has, as described more fully below. Thealgorithm for computing the term range in the case of a CC-loop state isessentially:

// CC is character class for loop transition

checkedEnd=cursor;

while(CC matches at checkedEnd)++checkedEnd;

Note that a lambda threshold of a lambda transition may be taken intoaccount when determining checked ranges. For example, a checked rangemay end a number of characters short of a term range, where that numberof characters is based on a lambda-transition threshold.

Consider the state machine of FIG. 4 operating on the subject:<<subject=−270.1C>>. The state machine will enter state D3 after the ‘2’character, which is at position 1 in the subject (indexed from zero).Thus, upon entry into state D3, the cursor will be set to position 2, inother words pointing to the “7” in the subject. D3 will then compute itsterm range to be [2,4], which are the positions, starting with thecursor, at which D3's CC-loop transition (looking for any digit 0-9)matches, along with one additional character. As explained more fullybelow, since each of D3's egress events (i.e. the transitions to D4 andD6) only have a length of one character in the subject, D3 will computeits checked range to also be [2,4]. These are the positions in thesubject that will be checked for the events that can move to anotherstate.

As shown in the previous example, in non-lambda-count states, the termrange and the checked range are the same. However, for count states(those involving lambda transitions), the term range and checked rangewill differ because the checked range denotes the range wheretransitions to the next state may occur, and that depends on the statecount. The use of state ranges is a mechanism to avoid rechecking CCevent matches.

With respect to lambda transitions, (recall that a Lambda1 transition istaken only if the count of the current state is equal to the lambdathreshold value), if we consider the regex: <<regex=^.{5}>>, the checkedrange for the CC-loop state corresponding to the dot character classwould normally extend ever the entire subject, since the dot isgenerally used in regex notation to denote “any character”. The natureof the Lambda1 function limits this range, however. The next state canbe entered only at position 5 in the subject, but the checked range ofthe loop state is defined as [0,0], since the validity of the Lambda1function has only been verified at this position. The cursor for thenext state is set to position 5, because the 5 characters were‘consumed’ by the Lambda1 transition. In this sense, it behaves similarto a string event.

Lambda2 state ranges are calculated and checked in exactly the same wayas Lambda1 ranges. However, when transitioning to the next state.Lambda2 transitions do not consume characters in the subject up to thelambda threshold count value. The cursor for the next state is set tothe cursor of the current state, thus not ‘consuming’ any characters inthe subject, instead of <<cursor+lambda threshold>>, as is the case withLambda1 transitions.

With respect to regexes that include start anchors, these are handled bysetting the checked range of the initial state (D0) to be[0,subjectLength] (which generally would be the checked range of theinitial state whether the regex in question begins with a start anchoror not), and processing an anchor transition only if the cursor is 0.When entering the next state after the anchor, the checked range is setto [0,0]. If there are other transitions from the initial state, thechecked range is computed in the usual way. This allows processing ofregexes like <<(^|[0-9]+)dog>>. A state is said to be anchored if itschecked range is a single position. Often this is determined by therange of the previous state and the event that caused the transition tothis state.

Additional processing advantages may be obtained using adopted ranges,where the checked range of a state is based on the checked range of aprevious state. Consider the example regex: <<regex=.*[0-9]555>> asshown in FIG. 6A. In this case, state D3 has a range that extends to theend of the subject because it has a ‘dot’ (i.e. any character − ingeneral the “.” is shorthand in regex processing for a character classthat matches anything) loop event. If the subject:subject=abcd123456781234555678efgh is processed, a CC match that allowsa transition to state D1 is found at position 4. Note that a transitionto state D1 can also occur at any position up to 21.

To avoid taking the transition to state D1 multiple times in cases likethis, the present methods and systems use the concept of checked ranges,and more specifically in this case, an adopted range, extending thechecked range of some states based on the checked range of theirprevious state. This can be done in states that do not have a CC-loopevent. The calculation of the end of the current state's checked rangemay be performed, using the previous state's checked range, as follows(where “CC” pertains to the character-class ([0-9]) that is associatedwith the transition from the previous state (D3) to the current state(D1), that transition also being known as the “ingress event”):

checkedEnd=cursor;

while((CC matches at checkedEnd) &&

(checkedEnd<previous_state→checkedEnd))++checked End;

The result of this in the present example is that the checked range forstate D1 is [5,22], and the search for the string term “555” can be donevery quickly, finding a match at position 19 in the subject. Thus,states following CC-loop states can sometimes adopt a range based on theprevious state's checked range.

Similarly, states following CC-loop-count states (i.e. states that aretransitioned into using lambda transitions) can adopt a range based onthe previous state, but care must be taken to account for charactersconsumed by the count function. For example, consider this regex:regex=^A+[A1-4]{5}B shown in FIG. 6B. The range of D4 is determined bythe range of D3. As such, we can compute the range of D4 as follows:

D4.start=D3.start+lambda_count−1;

D4.end=minimum(D3.end+lambda_count−1, subjectLength);

There is an offset of −1 when computing these values to account for thecharacter that was consumed on entering D3. There is a similar formulato compute the range of a state following a Lambda2 transition;

Dn.start=cursor;

Dn.end=MIN(cursor+Lcount−1, prev→end);

5. Incremental Operation

One additional aspect of the present methods and systems is the abilityto save the state of a search and resume searching as more data becomesavailable, such as when an additional packet arrives that includes anext part of a subject to be analyzed, as may often be the case whenanalyzing a flow of Transmission Control Protocol (TCP) data. Becausethe methods and systems described herein are not a pure DFAimplementation, this pausing and resumption of processing is not assimple as remembering the last state. Because there may be multipleattempted paths through the state machine depending on the subject, itis possible that the end of the available data can be reached inmultiple states. In accordance with the present methods and systems,only as much information as is needed, which often will be less than theentirety of the data that has arrived up to that point in theprocessing, is retained for resumption of processing upon arrival ofadditional data.

Every state in a graph has a characteristic value referred to as thestate maximum tail that defines the maximum number of characters at theend of a portion of a subject that need to be saved to allow restartingof processing. This value depends mainly on the possible transitions outof the state. If a state has one or more string transitions out, thestate maximum tail may be determined by the length of the longest stringevent among those string transitions. For example, if a state has anegress transition associated with a string that is 10 characters long,it cannot be guaranteed that a string match does not start 7 charactersbefore the end of the subject.

If a state has a CC-loop event and a lambda count egress transition, thethe state maximum tail may be equal to the lambda count. This is similarconceptually to the case of a string match, because the count conditionmay evaluate true at some point past the boundary between the availabledata and the data that is yet to arrive.

Difficulties may arise if a state has a CC-loop event and it has atleast one CC egress transition. CC-loop states define their own range bydetermining how long the CC-loop event matches from the cursor lookingforward in the subject. As described herein, states with a CC-matchingress event—and no CC-loop event—derive their state range from theprevious state's checked range. An adopted range preferably does notextend beyond the checked end of the previous state. This is notpossible in the case of a restart because, as described below, there isno previous state to which to refer.

The solution is to rely on the fact that the CC-loop state computes itsrange independent of previous states. For each CC-loop state, the systemcomputes the max depth of possible paths until either another CC-loopstate or accepting state is reached. This determines the maximum numberof states that can derive their range from the CC-loop state. Noting thefact that every CC transition to another state consumes one character,if the search starts that number (the state maximum tail for this typeof state) of characters back from the boundary, the system will reachthe maximum depth state at the boundary between the already-receiveddata and the later-received data. This means that no potential pathsthrough the state machine will be missed.

The actual tail length of a state is determined at the time of thesearch. When it is determined that a state should be restarted when newdata arrives (because a match/no match determination cannot be made),the system takes the maximum of the three possible tail lengthsdescribed above. It then computes the distance from the state entry(i.e. the cursor) to the end of the current data (referred to herein asthe state range tail). The actual tail length required to guaranteeaccurate searching on restart is the lesser of the state range tail andthe state maximum tail. Thus, a number of characters will be saved atthe end of the currently-evaluated portion of the subject, and thatnumber will be at least the actual tail length, and perhaps more,depending on the restart information saved in other states. The pausedstate will then fail back to its previous state after saving its restartinformation, and will convey its calculated actual tail length to theprevious state.

The restart information also preserves the distance from the end of thepaused state's checked range to the end of the currently-available data.Often this value will be 0, if the checked range extends to the end ofthe data. Sometimes, however, the checked range stops short of the endof data, but the search result cannot be determined (e.g. if waiting fora possible string match). Saving this value lets the checked range beset appropriately on restart and prevents possible erroneous matches.

For example, assume that a state has a state maximum tail of 10characters based on a string egress transition. If the state is entered5 characters before the end of the current data, the system should notrestart 10 characters before the boundary between the already-receiveddata and the later-received data, because that could lead to anerroneous match. On the other hand, if the state is entered 15characters from the end, and the checked range extends to 5 charactersbefore the end, the restart range would be set to (−10,−5) (expressed incharacter positions, relative to the old/new data boundary), because thestate's maximum tail is 10.

FIG. 7A graphically depicts the above-described concept of each statehaving a characteristic tail (state maximum tail), which defines a rangeat the end of the subject. Within that range, a full decision ontransitions cannot be made because more data is required. The tails aredetermined by the terms in the regex and are computed at compile time.In this case, states D0, D4, and D5 have checked ranges that overlap thestate tails, as shown by the white rectangles in FIG. 7B, where thedarker rectangles are the same state maximum tails from FIG. 7A. Thesethree states are thus candidates for restart, in other words forincremental operation. These three states are restarted when new dataarrives as shown in FIG. 7C. Note that, in all three figures, thevertical line represents the boundary between the data that wasavailable prior to pausing the processing (to the left of the verticalline) and the data that became available later, causing processing torestart (to the right of the vertical line).

6. Handling of String Terms in Regexes

Some preferred embodiments utilize keyword trees (KT) to preprocess asubject. A keyword tree is a data structure used to quickly locate fixedstrings in a longer string (e.g., the subject being inspected). Using aKT, a string match at a particular location may be treated as a singleevent that drives the state machine. That is, because the location ofall matches of fixed strings from the regex are known prior toprocessing the subject using the state machine, the regex parsing treatsstrings as single terms that drive transitions from one state toanother.

In one preferred embodiment, all of the string matches that occur in thesubject (i.e. strings from the regex that are found in the subjectduring a KT preprocessing step that occurs prior to using the statemachine generated from the regex to process the subject), sorted by thestarting position, are provided in an array. Sorting the KT matches ispreferably done in the keyword tree itself, and does not significantlyimpact performance of the KT. Matches only need to be sorted when thereis a longer string in the tree that includes a shorter substring. Thelonger string starts first, but the KT does not recognize the longerstring until after the short one has been located and included in theresult set. In this case the substring is shifted in the results arrayand the longer one is inserted in its place.

Preferably, the KT search has the ability to handle both case-sensitiveand case-insensitive searches. In the KT, each character of the keywordsis modeled as a node on a branch. If the keywords are added to the treeto support case insensitivity, then for a string of length N, allpermutations of case must be added (2^N keywords). Instead of adding thepermutations of keywords to the KT, the KT search function was modifiedto look for case insensitive paths. At every node the search functionchecks to see if there is a child node for the next character C. Ifthere is not, it checks for the case compliment C′.

The result is that the KT search returns all case-insensitive keywordmatches. In situations where case-sensitive match is important, a checkcan be made to confirm the case match. This check only needs to beperformed if the regex search has reached a point where a case-sensitivestring match can cause a state transition, and the check is basically acompare over the length of the string at the location of the match inthe subject. The rule base used to process the subject is shown in FIG.8A.

As can be seen in FIG. 8A, the fixed strings in the regex areidentified, and the subject is preprocessed using a KT, and the resultsof that KT are added to the rule base, such that the presence andlocation in the subject of the strings from the regex are known andstored in the rule based prior to processing the subject using the statemachine generated from the regex. Furthermore, the character classes inthe regex are identified and encoded as bit-mask byte words, asdescribed above. Finally, the state machine is generated from the regexas described herein, and stored in the rule base as a state-transitiontable, essentially with an entry for each state, where each entryincludes the parameters described herein, along with information as tothat state's egress transitions.

In a further embodiment, a traditional string search is utilized. Onesuch search methodology is a Boyer-Moore (BM) search algorithm, which isan efficient and widely used way of finding strings in a subject. Thebulk of the logic for evaluating regexes is substantially the same. Theprimary difference is in how string transitions are processed. Insteadof iterating through a set of KT string matches looking for a particularstring, a BM search in the subject is performed from the currentposition to the end of the state's checked range. Any matches foundcause the string transition to be taken. The change from KT to BMchanges what gets compiled into the rule base. Because the Boyer-Moorealgorithm preprocesses the search string into information that can besaved for expediting the real-time BM searching in the subject, thestrings from the regex are stored in the rule base, as shown in FIG. 8B.

In a further alternative embodiment, a hybrid of using the KT and BMmethods described above may be used. In some situations, the KT producesnumerous hits for 2-character, 3-character, and 4-character strings, forexample. It also produces many hits for strings with a single repeatedcharacter. For example, if the KT includes strings ‘00’, ‘000’, ‘0000’,and the subject includes a long block of 0s, the the KT generates threehits for each position in the subject. In tests, it was not unusual toget 500-1000 KT hits for a 2000-byte subject, causing degradedperformance.

Thus, Boyer-Moore is preferably used to analyze strings such as short oruniform strings, or other cases that are not handled efficiently by theKT. Preferably, at compile time, the string transitions are flagged tospecify whether they are to be handled by BM or KT treatment at searchtime. This changes the rule base to look like that shown in FIG. 8C,where the rule base contains KT results for strings that will be handledby the KT process described herein, as well as encoded strings in the BMpattern information that will be used in real-time when performing a BMsearch, along with the CC table and state-transition table, as describedherein.

7. Object Model

In accordance with the present methods and systems, an object-orientedprogramming approach may be used, in accordance with which certainprogramming objects may be implemented. As an example, once the statemachine has been generated from the regex, that state machine may bestored in memory or other data storage in the form of a state-transitiontable. As such, each state in the state machine may be implemented as aninstance of a programming object referred to herein as a “State,” whilea “State Graph” programming object may include a pointer to the initialState (State *initialState) in the machine, along with an integer valuecorresponding to the number of States in the machine (int stateCount).Conceptually, the State Graph may be thought of as representing theentire state machine, and essentially is a ‘container’ for the statemachine. Thus, the State Graph object may have a structure similar tothat shown in the following table.

State Graph

-   -   // Pointer to the Initial State in the State Machine State        *initialState    -   // Count of the Number of States in the State Machine int        stateCount

As described above, each State in the State Graph may be implemented asan instance of the State object. Each State may have a pointer to alinked list of one or more Transition objects (Transition *transitions),representing the one or more egress events to other States in the StateGraph. Furthermore, each State may have a Boolean variable thatindicates whether that State is an accepting state (i.e. match state) ornot (bool isAcceptingState). Each State object may further include aninteger variable representing the lambda count for the state, in thecase that the State has a lambda egress event (int maxLambdaCount).Furthermore, each State may have an integer variables representing theState's maximum tail length (int maxTailLength) and maximum transitionevent length (int maxTransitionEventLength), which would be used in theevent that the State needed to be paused with saved restart information,as described above. These variables help determine how much data must bekept to be able to restart a search when more data arrives.

Furthermore, each State object may also have certain dynamic search-timevalues. For example, each State may have a start and end value for aterm range (int termRangeStart and int termRangeEnd), along with startand end values for a checked range (int checkedRangeStart and intcheckedRangeEnd), to be used in calculating and storing the state rangesdescribed herein. In addition, each state may have an integer statecount variable (int count), for use in storing the current number oftimes that the State has been transitioned into, for use in makinglambda-transition determinations, as described herein. As such, a Stateprogramming object may have a structure as follows.

State

-   -   // Transitions connect States and define the structure of the        state machine Transition *transitions    -   // Static Parameters of the State    -   bool isAcceptingState    -   int maxLambdaCount    -   // Restart Information    -   int maxTailLength    -   int maxTransitionEventLength    -   //Dynamic Search-Time Data    -   int termRangeStart    -   int termRangeEnd    -   int checkedRangeStart    -   int checkedRangeEnd    -   int count

As shown above, each State object contains a transitions pointer to alinked list of Transition objects. In accordance with the presentmethods and systems, the Transition object represents transitionsbetween States, and defines the structure of the State Graph (i.e. thestate machine). Each instance of the Transition object contains anotherprogramming object called an Event (Event event), explained below. Andeach instance of the Transition object further includes a pointer (State*nextState) to a State object representing the next state in the statemachine (i.e. State Graph, state-transition table, etc.). Thus, theTransition object may have a structure as follows.

Transition

-   -   // An even associated with the particular transition Event event    -   // A pointer to the next state, thus defining the structure of        the State Graph State *nextState

As referenced above, each Transition in the object model has anassociated Event. As further explained herein, there are different typesof transitions in accordance with the present systems and methods. Assuch, there are different types of Events. The Event object includes ana value corresponding to an event type (EventType type), which may beimplemented in the software as an enumerated type, along with an integerparameter used as an event identifier (int eventId).

The EventType may be one of the following: character class, string,anchor, Lambda1, or Lambda2, corresponding to different types oftransitions described herein. The eventId takes on different meaningsdepending upon the value of the event type. If the event is a characterclass or string, the eventId may correspond to an element in a stored CCtable or keyword tree, as described herein. If the event is an anchor,the eventId may indicate whether the anchor is a start anchor or an endanchor. Finally, if the event is a Lambda1 or Lambda2, the eventId mayrepresent the lambda threshold associated with the transition, forcomparison with the running state count in evaluating the lambdathreshold condition. Thus, the Event object may have a structure asfollows.

Event

-   -   // Indicates whether the event is CC, String, Anchor, Lambda1,        or Lambda2 EventType type    -   // Takes on different significance depending on the type of        event int eventID

8. Applications

As described herein, the present methods and systems relate toprocessing data subjects to identify therein the presence ofpredetermined data patterns, such as can be expressed using regularexpressions. As also described herein, one application in which thepresent methods and systems may be applied is in intrusion-preventionand intrusion-detection systems. In that context, once a particularsubject (e.g. payload of one or more packets) has been identified asmatching a particular regex (e.g. an attack signature), a number ofdifferent actions may be taken with respect to that packet or thosepackets, such as quarantining them, discarding them, deleting them,saving them, notifying a user (e.g. a network administrator) inreal-time and/or in a stored report, and/or any other suitable action.

A somewhat related context in which the present methods and systems maybe applied is scanning a particular file, volume, directory, disk, etc.on a particular computer or group of computers. Thus, in addition toscanning flowing network traffic, data that is not being transferred atthe time can be scanned as well for particular signature patterns thatmay indicate a virus, spyware, and/or any other threat. In this context,upon detecting a match, some similar actions may be taken with respectto particular files, directories, etc., such as quarantining, deleting,notifying a user of the event, prompting a user to determine a nextaction, and/or any other suitable manner of dealing with an identifiedactual or potential threat.

Another context in which the present methods and systems may proveuseful is the context often known as “extrusion prevention,” whichrelates to attempting to prevent certain types of information from beingtransferred from a particular computer, group of computers, network,server. This may include preventing extrusions caused by hackers, forexample, as well as extrusions sent from within, such as by a disloyalemployee, for example. Upon detection of such an extrusion, a particulartransfer or flow could be stopped, the person or persons causing theextrusion could be identified in a report, a network administrator couldbe alerted, etc.

An additional context in which the present methods and systems may beapplied is the context often known as deep packet inspection (DPI),which generally refers to examining packets at higher layers of the OpenSystems Interconnection (OSI) reference model, such as the applicationlayer, the presentation layer, the session layer, etc. DPI is often donein real-time by a device such as a network switch or router. Theparticular patterns being searched for may be related to the type ofdata, the application being used by an end user, a type of data such asstreaming video, a particular source such as a particular website, etc.

Upon detection of a match, any number of responsive actions could betaken, including logging a particular flow (also known as “packetcapture”) to save for later inspection, directing a flow to a particulardecoder or processing engine as appropriate, intercepting web-basede-mail associated with a particular provider, extracting content,identifying packet types or users, identifying applications (peer topeer, VoIP), throttling (i.e. rate limiting) the data packets (QOS),monitoring outbound traffic, and/or any other action. An additionalaction that may be taken upon identifying a matching pattern in a packetor set of packets may be to route a copy of one or more packets to anauthorized law-enforcement agency.

Another context in which the present methods and systems may be appliedis the context of searching a particular database for particular datapatterns, perhaps as requested by users of the database. As an example,an Internet search engine may use the present methods and systems toimprove searching for matches to regular expressions (which may begenerated from search strings or provided directly by users). Uponfinding a match, a list of matching documents may be presented to auser. More generally in the context of database searching, the presentmethods and systems may be used for data mining, to sift through adatabase and perhaps highlight correlations of significance, createreports indicating data-mining results, etc.

Another context in which the present methods and systems may be appliedis the context of a text editor and/or word processor. Upon finding amatch in a document or set of documents, matches may be highlighted fora user, who may then be able to iterate through the matches using a“next match” command on a user interface of the text editor or wordprocessor. Furthermore, a report of matches may be produced for a user.As another option, a search-and-replace operation may be carried out,thereby changing the contents of one or more documents according to aset of matching locations identified using the present methods andsystems.

As another example, in the context of bioinformatics, the presentmethods and systems may be used to search for particular gene markers inDNA, where the gene markers are set forth in a regular expression. As inother contexts, a list of matching locations could be presented to auser for further processing.

9. Exemplary Operation a. A First Exemplary Method

FIG. 9 depicts an exemplary method 900, which may be carried out in anintrusion-prevention system for examining network traffic andidentifying therein the presence of signature data patterns. Inaccordance with the method, at 902, a state-transition table isprovided, said table representative of a predetermined data pattern, andcomprising a plurality of states, each state having a set of egressevents, each egress event defining a transition from a current state toa next state. The state-transition table may be representative of astate diagram that itself is representative of the predetermined datapattern. The predetermined data pattern may be representative of aregular expression. Furthermore, each egress event may be either acharacter class or a character string.

At 904, the predetermined data pattern is parsed to identify a set ofcharacter strings therein. The identified set of character strings inthe predetermined data pattern may consist of those character stringsthat (a) include at least two distinct characters and (b) have a stringlength that is greater than a threshold number.

Further in accordance with the exemplary method, at 906, a subject isreceived, where the subject is to be evaluated for the presence of thepredetermined data pattern. The subject is preprocessed to find thereinany instances of the character strings identified in the predetermineddata pattern. Preprocessing the subject may involve using a keyword-treesearch.

At 908, a keyword table is then populated with a subset of theidentified character strings, the subset consisting of those characterstrings found in the subject during preprocessing. The subject maycomprise a payload of one or more packets, and the presence of thepredetermined data pattern may be indicative of a potential securitythreat.

At 910, while using the state-transition table to evaluate the subjectfor the presence of the predetermined data pattern, a first state istransitioned into, where the first state has a first one of theidentified character strings a first egress event thereof. The firstegress event defines a transition from the first state to a secondstate.

At 912, responsive to transitioning into the first state, the keywordtable is checked for the first character string. Responsive to findingthe first character string in the keyword table, the transition is takenfrom the first state to the second state. Transitioning from one stateto another may involve recursively calling a state-search function, andthat function may be implemented in a manner similar to the followingpseudocode.

Status searchFunction( prevState, ingressEvent, currentState, subject,subjectLen, cursor ) {  if( currentState->isAcceptingState ) returnMATCH;  if( can't proceed )  {   // At end of subject so nothing left tosearch?   // State count > lambda condition so transitions out no longer    possible?   Return NO_MATCH;  }  currentState->count++; computeCheckedRange(prevState, ingressEvent, currentState, subject,   subjectLen, cursor );  for( each transition from currentState )  {  nextState = transition->nextState;   if( transition->event->type ==STRING_MATCH_EVENT )   {    for( each string match in currentState'scheck range )    {     // Find position after sting match     nextCursor= stringMatchPosition + stringLen;     result = searchFunction(currentState,       transition->event,      transition->nextState,       subject,       subjectLen,      nextCursor );     if( result = MATCH) return MATCH;    }   }  else if (transition->event->type == CC_MATCH_EVENT )   {    for( indexis in checked range, starting at cursor )    {     if( subject[index]matches      transition->event->CC )     {      result =searchFunction(currentState,       transition->event,       transition->nextState,      subject,       subjectLen,       nextCursor );      if( result =MATCH) return MATCH;      // No need to try nextState again until      after it's checked end      index = nextState->checkedEnd;     }    else     {      // Try the next position      index++;     }    }  }  }  currentState->count--;  return NO MATCH; }

Preprocessing the subject may involve identifying positions in thesubject where the instances of the identified character strings arelocated, and the keyword table may be populated with the identifiedpositions. Furthermore, a first-state range may be calculated, where thefirst-state range is a range of positions in the subject in which tosearch for the presence of at least one of the first state's egressevents. As such, checking the keyword table for the first characterstring may involve checking the keyword table for an instance of thefirst character string at a position within the first-state range. Also,finding the first character string in the keyword table may involvefinding in the keyword table an instance of the first character stringat a position within the first-state range.

A cursor may correspond to a location in the subject that is currentlybeing evaluated. Furthermore, the transition into the first state mayhave been from a state referred to here as a previous state, accordingto an egress event referred to here as a previous-state egress event,and the previous state may have an associated previous-state range inthe subject. Calculating the first-state range may involve setting astart of the first-state range equal to the cursor; and then, startingat the cursor, and extending no further than an end of theprevious-state range, determining that the subject includes a number ofconsecutive instances of the previous-state egress event that end at afirst position in the subject; and setting an end of the first-staterange based on the first position. Furthermore, it may be determinedthat the first state does not have a character-class loop transition.And the previous-state range may be calculated.

Alternatively, calculating the first-state range may comprisedetermining that the first state has a character-class loop transition;setting a start of the first-state range equal to the cursor; and then,starting at the cursor, determining that the subject includes a numberof consecutive characters that satisfy the character-class looptransition and that end at a first position in the subject; and settingan end of the first-state range based on the first position.

b. A Second Exemplary Method

FIG. 10 depicts a second exemplary method, which may also be carried outin an intrusion-prevention system for examining network traffic andidentifying therein the presence of signature data patterns. Inaccordance with the method, at 1002, a state-transition table isprovided, where the table is representative of a predetermined datapattern, and includes states that each have a set of egress events thatdefine transitions to next states. The state-transition table may berepresentative of a state diagram, where the state diagram isrepresentative of the predetermined data pattern. The predetermined datapattern may be representative of a regular expression. Furthermore, eachegress event may be either a character class or a character string.

Further in accordance with this embodiment, at 1004, a subject isreceived for evaluation for the presence of the predetermined datapattern. The subject may comprise a payload of one or more packets, andthe presence of the predetermined data pattern may be indicative of apotential security threat.

At 1006, while using the state-transition table to evaluate the subject,a first state is transitioned into, where the first state has a firstcharacter string as a first egress event thereof, defining a transitionfrom the first state to a second state.

At 1008, responsive to transitioning into the first state, a Boyer-Mooresearch is performed for the first character string in the subject. Thissearch may be performed responsive to making either or both of thefollowing determinations: (a) that the first character string does notinclude at least two distinct characters and (b) that the firstcharacter string has a string length that is less than a thresholdnumber.

In some embodiments, a first-state range may be calculated, where thefirst-state range is a range of positions in the subject in which tosearch for the presence of at least one of the first state's egressevents. As such, performing the Boyer-Moore search for the firstcharacter string in the subject may comprise performing the Boyer-Mooresearch for the first character string in the first-state range.

A cursor may correspond to a location in the subject that is currentlybeing evaluated. Furthermore, transitioning into the first state mayinvolve transitioning from a previous state into the first stateaccording to a previous-state egress event, where the previous state hasan associated previous-state range in the subject. As such, calculatingthe first-state range may comprise setting a start of the first-staterange equal to the cursor; starting at the cursor, and extending nofurther than an end of the previous-state range, determining that thesubject includes a number of consecutive instances of the previous-stateegress event, the consecutive instances ending at a first position inthe subject; and setting an end of the first-state range based on thefirst position. It may also be determined that the first state does nothave a character-class loop transition. And the previous-state range maybe calculated.

In other embodiments, calculating the first-state range may comprisedetermining that the first state has a character-class loop transition;setting a start of the first-state range equal to the cursor; startingat the cursor, determining that the subject includes a number ofconsecutive characters that satisfy the character-class loop transition,the consecutive instances ending at a first position in the subject; andsetting an end of the first-state range based on the first position.

Further in accordance with the methods, upon the Boyer-Moore searchdetermining that an instance of the first character string is present inthe subject, the transition from the first state to the second state isresponsively taken. Transitioning from one state to another may involverecursively calling a state-search function, which may be implemented ina manner similar to the pseudocode provided above in connection withmethod 900.

c. A Third Exemplary Method

FIG. 11 depicts a third exemplary method, which may also be carried outin an intrusion-prevention system for examining network traffic andidentifying therein the presence of signature data patterns. Inaccordance with this method, at 1102, a state-transition tablerepresentative of a predetermined data pattern is provided, where thestate-transition table comprises a plurality of states, each statehaving a set of egress events, each egress event defining a transitionfrom a current state to a next state. The state-transition table may berepresentative of a state diagram, which itself may be representative ofthe predetermined data pattern. The predetermined data pattern may berepresentative of a regular expression. And each egress event may beeither a character class or a character string.

Further in accordance with this method, at 1104, the predetermined datapattern is parsed to identify a set of a first type of character stringstherein. The first type of character string may be defined by both (a)including at least two distinct characters and (b) having a stringlength greater than a threshold number.

Further in accordance with this method, at 1106, a subject is receivedfor evaluation for the presence of the predetermined data pattern. Thesubject may comprise a payload of one or more packets, and the presenceof the predetermined data pattern may be indicative of a potentialsecurity threat. The subject is preprocessed, perhaps using akeyword-tree search, to find therein any instances of the identifiedcharacter strings.

Further in accordance with this method, at 1108, a keyword table ispopulated with a subset of the identified character strings, the subsetconsisting of those character strings found in the subject duringpreprocessing. Preprocessing the subject may involve identifyingpositions in the subject where the instances of the identified characterstrings are located, and populating the keyword table with theidentified positions.

Further in accordance with this method, at 1110, while using thestate-transition table to evaluate the subject for the presence of thepredetermined data pattern, a first state is transitioned into. Thefirst state has a given character string as a first egress eventthereof, where the first egress event defines a transition from thefirst state to a second state.

Further in accordance with this method, at 1112, responsive totransitioning into the first state, the subject is searched for aninstance of the given character string, and, responsive to determiningthat there is an instance of the given character string in the subject,the transition from the first state to the second state is taken.

When the given character string is of the first type, searching thesubject for an instance of the given character string comprises checkingthe keyword table for the given character string, and determining thatthere is an instance of the given character string in the subjectcomprises finding the given character string in the keyword table.

When the given character string is of a second type different from thefirst type, searching the subject for an instance of the given characterstring comprises performing a Boyer-Moore search for the given characterstring in the subject, and determining that there is an instance of thegiven character string in the subject comprises the Boyer-Moore searchdetermining that an instance of the given character string is present inthe subject. The second type of character string may be defined byeither or both of (a) not including at least two distinct characters and(b) having a string length less than or equal to the threshold number.

In some embodiments, a first-state range may be calculated, where thefirst-state range is a range of positions in the subject in which tosearch for the presence of at least one of the first state's egressevents. As such, checking the keyword table for the given characterstring may involve checking the keyword table for an instance of thegiven character string at a position within the first-state range.Moreover, finding the given character string in the keyword table mayinvolve finding in the keyword table an instance of the given characterstring at a position within the first-state range. And performing theBoyer-Moore search for the given character string in the subject mayinvolve performing the Boyer-Moore search for the given character stringin the first-state range, while the Boyer-Moore search determining thatan instance of the given character string is present in the subject mayinvolve the Boyer-Moore search determining that an instance of the givencharacter string is present in the first-state range.

A cursor may correspond to a location in the subject that is currentlybeing evaluated. And transitioning into the first state may comprisetransitioning from a previous state into the first state according to aprevious-state egress event, where the previous state has an associatedprevious-state range in the subject. As such, calculating thefirst-state range may comprise setting a start of the first-state rangeequal to the cursor; starting at the cursor, and extending no furtherthan an end of the previous-state range, determining that the subjectincludes a number of consecutive instances of the previous-state egressevent, the consecutive instances ending at a first position in thesubject; and setting an end of the first-state range based on the firstposition. It may be determined that the first state does not have acharacter-class loop transition. And the previous-state range may becalculated.

In other embodiments, calculating the first-state range may comprisedetermining that the first state has a character-class loop transition;setting a start of the first-state range equal to the cursor; startingat the cursor, determining that the subject includes a number ofconsecutive characters that satisfy the character-class loop transition,the consecutive instances ending at a first position in the subject; andsetting an end of the first-state range based on the first position. Inthis embodiment as in all embodiments, transitioning from one state toanother state may involve recursively calling a state-search function.

In another aspect, an exemplary embodiment may take the form of anintrusion-prevention network device for examining network traffic andidentifying therein the presence of signature data patterns. The networkdevice comprises a network interface, a processor, and data storage. Thedata storage comprises a state-transition table representative of apredetermined data pattern, the state-transition table comprising aplurality of states, each state having a set of egress events, eachegress event defining a transition from a current state to a next state.The data storage further comprises instructions executable by theprocessor to carry out the methods described herein, in any suitablecombinations and permutations.

Note as well that some or all of the variations described above withrespect to the method embodiments may also apply to the network-deviceembodiments, in any suitable combinations and permutations.

The invention claimed is:
 1. In an intrusion-prevention system forexamining network traffic and identifying therein the presence ofsignature data patterns, a method comprising: providing astate-transition table representative of a predetermined data pattern,the state-transition table comprising a plurality of states, each statehaving a set of egress events, each egress event defining a transitionfrom a current state to a next state; parsing the predetermined datapattern to identify a set of character strings therein; receiving asubject to be evaluated for the presence of the predetermined datapattern, and preprocessing the subject to find therein any instances ofthe identified character strings; populating a keyword table with asubset of the identified character strings, the subset consisting ofthose character strings found in the subject during preprocessing; whileusing the state-transition table to evaluate the subject for a presenceof the predetermined data pattern, transitioning into a first statehaving a first one of the identified character strings as a first egressevent thereof, the first egress event defining a transition from thefirst state to a second state; and responsive to transitioning into thefirst state, checking, by a processing unit, the keyword table for thefirst character string, and, responsive to finding the first characterstring in the keyword table, transitioning, by the processing unit, fromthe first state to the second state.
 2. The method of claim 1, whereinthe state-transition table is representative of a state diagram, thestate diagram representative of the predetermined data pattern.
 3. Themethod of claim 1, wherein the predetermined data pattern isrepresentative of a regular expression.
 4. The method of claim 1,wherein each egress event is either a character class or a characterstring.
 5. The method of claim 1, wherein the identified set ofcharacter strings in the predetermined data pattern consists of thosecharacter strings in the predetermined data pattern that (a) include atleast two distinct characters and (b) have a string length that isgreater than a threshold number.
 6. The method of claim 1, wherein thesubject comprises a payload of one or more packets.
 7. The method ofclaim 1, wherein the presence of the predetermined data pattern isindicative of a potential security threat.
 8. The method of claim 1,wherein preprocessing the subject comprises using a keyword-tree search.9. The method of claim 1, wherein preprocessing the subject comprisesidentifying positions in the subject where the instances of theidentified character strings are located, the method further comprisingpopulating the keyword table with the identified positions.
 10. Themethod of claim 9, further comprising calculating a first-state range,the first-state range being a range of positions in the subject in whichto search for the presence of at least one of the first state's egressevents, wherein: checking the keyword table for the first characterstring comprises checking the keyword table for an instance of the firstcharacter string at a position within the first-state range; and findingthe first character string in the keyword table comprises finding in thekeyword table an instance of the first character string at a positionwithin the first-state range.
 11. The method of claim 10, wherein acursor corresponds to a location in the subject that is currently beingevaluated.
 12. The method of claim 11, wherein transitioning into thefirst state comprises transitioning from a previous state into the firststate according to a previous-state egress event, wherein the previousstate has an associated previous-state range in the subject, and whereincalculating the first-state range comprises: setting a start of thefirst-state range equal to the cursor; starting at the cursor, andextending no further than an end of the previous-state range,determining that the subject includes a number of consecutive instancesof the previous-state egress event, the consecutive instances ending ata first position in the subject; and setting an end of the first-staterange based on the first position.
 13. The method of claim 12, furthercomprising determining that the first state does not have acharacter-class loop transition.
 14. The method of claim 12, furthercomprising calculating the previous-state range.
 15. The method of claim11, wherein calculating the first-state range comprises: determiningthat the first state has a character-class loop transition; setting astart of the first-state range equal to the cursor; starting at thecursor, determining that the subject includes a number of consecutivecharacters that satisfy the character-class loop transition, theconsecutive instances ending at a first position in the subject; andsetting an end of the first-state range based on the first position. 16.The method of claim 1, wherein transitioning from one state to anotherstate comprises recursively calling a state-search function.
 17. Anintrusion-prevention network device for examining network traffic andidentifying therein the presence of signature data patterns, the networkdevice comprising: a network interface; a processing unit; and datastorage comprising: a state-transition table representative of apredetermined data pattern, the state-transition table comprising aplurality of states, each state having a set of egress events, eachegress event defining a transition from a current state to a next state;and instructions executable by the processing unit to: parse thepredetermined data pattern to identify a set of character stringstherein; receive a subject to be evaluated for a presence of thepredetermined data pattern, and preprocess the subject to find thereinany instances of the identified character strings; populate a keywordtable with a subset of the identified character strings, the subsetconsisting of those character strings found in the subject duringpreprocessing; while using the state-transition table to evaluate thesubject for the presence of the predetermined data pattern, transitioninto a first state having a first one of the identified characterstrings as a first egress event thereof, the first egress event defininga transition from the first state to a second state; and responsive totransitioning into the first state, check the keyword table for thefirst character string, and, responsive to finding the first characterstring in the keyword table, transition from the first state to thesecond state.
 18. In an intrusion-prevention system for examiningnetwork traffic and identifying therein the presence of signature datapatterns, a method comprising: providing a state-transition tablerepresentative of a predetermined data pattern, the state-transitiontable comprising a plurality of states, each state having a set ofegress events, each egress event defining a transition from a currentstate to a next state; receiving a subject to be evaluated for apresence of the predetermined data pattern; while using thestate-transition table to evaluate the subject for the presence of thepredetermined data pattern, transitioning into a first state having afirst character string as a first egress event thereof, the first egressevent defining a transition from the first state to a second state; andresponsive to transitioning into the first state, performing, by aprocessing unit, a Boyer-Moore search for the first character string inthe subject, and, responsive to the Boyer-Moore search determining thatan instance of the first character string is present in the subject,transitioning, by the processing unit, from the first state to thesecond state; calculating a first-state range, the first-state rangebeing a range of positions in the subject in which to search for thepresence of at least one of the first state's egress events, whereinperforming the Boyer-Moore search for the first character string in thesubject comprises performing the Boyer-Moore search for the firstcharacter string in the first-state range; wherein a cursor correspondsto a location in the subject that is currently being evaluated; whereintransitioning into the first state comprises transitioning from aprevious state into the first state according to a previous-state egressevent, wherein the previous state has an associated previous-state rangein the subject, and wherein calculating the first-state range comprises:setting a start of the first-state range equal to the cursor; startingat the cursor, and extending no further than an end of theprevious-state range, determining that the subject includes a number ofconsecutive instances of the previous-state egress event, theconsecutive instances ending at a first position in the subject; andsetting an end of the first-state range based on the first position. 19.The method of claim 18, further comprising determining that the firststate does not have a character-class loop transition.
 20. The method ofclaim 18, further comprising calculating the previous-state range. 21.In an intrusion-prevention system for examining network traffic andidentifying therein the presence of signature data patterns, a methodcomprising: providing a state-transition table representative of apredetermined data pattern, the state-transition table comprising aplurality of states, each state having a set of egress events, eachegress event defining a transition from a current state to a next state;parsing the predetermined data pattern to identify a set of a first typeof character strings therein; receiving a subject to be evaluated for apresence of the predetermined data pattern, and preprocessing thesubject to find therein any instances of the identified characterstrings; populating a keyword table with a subset of the identifiedcharacter strings, the subset consisting of those character stringsfound in the subject during preprocessing; while using thestate-transition table to evaluate the subject for the presence of thepredetermined data pattern, transitioning into a first state having agiven character string as a first egress event thereof, the first egressevent defining a transition from the first state to a second state; andresponsive to transitioning into the first state, searching, by aprocessing unit, the subject for an instance of the given characterstring, and, responsive to determining that there is an instance of thegiven character string in the subject, transitioning from the firststate to the second state, wherein, when the given character string isof the first type, searching, by the processing unit, the subject for aninstance of the given character string comprises checking the keywordtable for the given character string, and determining that there is aninstance of the given character string in the subject comprises findingthe given character string in the keyword table, and wherein, when thegiven character string is of a second type different from the firsttype, searching, by the processing unit, the subject for an instance ofthe given character string comprises performing a Boyer-Moore search forthe given character string in the subject, and determining that there isan instance of the given character string in the subject comprises theBoyer-Moore search determining that an instance of the given characterstring is present in the subject.
 22. The method of claim 21, whereinthe state-transition table is representative of a state diagram, thestate diagram representative of the predetermined data pattern.
 23. Themethod of claim 21, wherein the predetermined data pattern isrepresentative of a regular expression.
 24. The method of claim 21,wherein each egress event is either a character class or a characterstring.
 25. The method of claim 21, wherein: the first type of characterstring is defined by both (a) including at least two distinct charactersand (b) having a string length greater than a threshold number, and thesecond type of character string is defined by either or both of (a) notincluding at least two distinct characters and (b) having a stringlength less than or equal to the threshold number.
 26. The method ofclaim 21, wherein the subject comprises a payload of one or morepackets.
 27. The method of claim 21, wherein the presence of thepredetermined data pattern is indicative of a potential security threat.28. The method of claim 21, wherein preprocessing the subject comprisesusing a keyword-tree search.
 29. The method of claim 21, whereinpreprocessing the subject comprises identifying positions in the subjectwhere the instances of the identified character strings are located, themethod further comprising populating the keyword table with theidentified positions.
 30. The method of claim 29, further comprisingcalculating a first-state range, the first-state range being a range ofpositions in the subject in which to search for the presence of at leastone of the first state's egress events, wherein: checking the keywordtable for the given character string comprises checking the keywordtable for an instance of the given character string at a position withinthe first-state range; and finding the given character string in thekeyword table comprises finding in the keyword table an instance of thegiven character string at a position within the first-state range;performing the Boyer-Moore search for the given character string in thesubject comprises performing the Boyer-Moore search for the givencharacter string in the first-state range; and the Boyer-Moore searchdetermining that an instance of the given character string is present inthe subject comprises the Boyer-Moore search determining that aninstance of the given character string is present in the first-staterange.
 31. The method of claim 30, wherein a cursor corresponds to alocation in the subject that is currently being evaluated.
 32. Themethod of claim 31, wherein transitioning into the first state comprisestransitioning from a previous state into the first state according to aprevious-state egress event, wherein the previous state has anassociated previous-state range in the subject, and wherein calculatingthe first-state range comprises: setting a start of the first-staterange equal to the cursor; starting at the cursor, and extending nofurther than an end of the previous-state range, determining that thesubject includes a number of consecutive instances of the previous-stateegress event, the consecutive instances ending at a first position inthe subject; and setting an end of the first-state range based on thefirst position.
 33. The method of claim 32, further comprisingdetermining that the first state does not have a character-class looptransition.
 34. The method of claim 32, further comprising calculatingthe previous-state range.
 35. The method of claim 31, whereincalculating the first-state range comprises: determining that the firststate has a character-class loop transition; setting a start of thefirst-state range equal to the cursor; starting at the cursor,determining that the subject includes a number of consecutive charactersthat satisfy the character-class loop transition, the consecutiveinstances ending at a first position in the subject; and setting an endof the first-state range based on the first position.
 36. The method ofclaim 21, wherein transitioning from one state to another statecomprises recursively calling a state-search function.
 37. Anintrusion-prevention network device for examining network traffic andidentifying therein the presence of signature data patterns, the networkdevice comprising: a network interface; a processing unit; and datastorage comprising: a state-transition table representative of apredetermined data pattern, the state-transition table comprising aplurality of states, each state having a set of egress events, eachegress event defining a transition from a current state to a next state;and instructions executable by the processing unit to: parse thepredetermined data pattern to identify a set of a first type ofcharacter strings therein; receive a subject to be evaluated for apresence of the predetermined data pattern, and preprocess the subjectto find therein any instances of the identified character strings;populate a keyword table with a subset of the identified characterstrings, the subset consisting of those character strings found in thesubject during preprocessing; while using the state-transition table toevaluate the subject for the presence of the predetermined data pattern,transition into a first state having a given character string as a firstegress event thereof, the first egress event defining a transition fromthe first state to a second state; and responsive to transitioning intothe first state, search the subject for an instance of the givencharacter string, and, responsive to determining that there is aninstance of the given character string in the subject, transition fromthe first state to the second state, wherein, when the given characterstring is of the first type, the instructions to search the subject foran instance of the given character string comprise instructions to checkthe keyword table for the given character string, and the instructionsto determine that there is an instance of the given character string inthe subject comprise instructions to find the given character string inthe keyword table, and wherein, when the given character string is of asecond type different from the first type, the instructions to searchthe subject for an instance of the given character string compriseinstructions to perform a Boyer-Moore search for the given characterstring in the subject, and the instructions to determine that there isan instance of the given character string in the subject compriseinstructions to determine from the Boyer-Moore that an instance of thegiven character string is present in the subject.